13 research outputs found

    Accelerating sequential programs using FastFlow and self-offloading

    Full text link
    FastFlow is a programming environment specifically targeting cache-coherent shared-memory multi-cores. FastFlow is implemented as a stack of C++ template libraries built on top of lock-free (fence-free) synchronization mechanisms. In this paper we present a further evolution of FastFlow enabling programmers to offload part of their workload on a dynamically created software accelerator running on unused CPUs. The offloaded function can be easily derived from pre-existing sequential code. We emphasize in particular the effective trade-off between human productivity and execution efficiency of the approach.Comment: 17 pages + cove

    An Optimization Theory for Structured Stencil-based Parallel Applications

    Get PDF
    In this thesis, we introduce a new optimization theory for stencil-based applications which is centered both on a modiļ¬cation of the well known owner-computes rule and on base but powerful properties oftoroidal spaces. The proposed optimization techniques provide notable results in diļ¬€erent computational aspects: from the reduction of communication overhead to the reduction of computation time, through the minimization of memory requirement without performance loss. All classical optimization theory is based on deļ¬ning transformations that can produce optimized programs which are computationally equivalent to the original ones. According to Kennedy, two programs are equivalent if, from the same input data, they produce identical output data. As other proposed modiļ¬cations to the owner-computes rule, we exploit stencil application feature of being characterized by a set of consecutive steps. For such conļ¬gurations, it is possible to deļ¬ne speciļ¬c two phase optimizations. The ļ¬rst phase is characterized by the application of program transformations which result in an eļ¬ƒcient computation of an output that be easily converted into the original one. In other words the transformed program deļ¬ned by the ļ¬rst phase is not computational equivalent with respect to the original one. The second phase converts the output of the previous phase back into the original one exploiting optimized technique in order to introduce the lowest additional overhead. The phase guarantees the computational equivalence of the approach. Obviously, in order to deļ¬ne an interesting new optimization technique, we have to prove that the overall performance of the two phases sequence is greater than the one of the original program. Exploiting a structured approach and studying this optimization theory on stencils featuring speciļ¬c patterns of functional dependencies, we discover a set of novel transformations which result in signiļ¬cant optimizations. Among the new transformations, the most notable one, which aims to reduce the number of communications necessary to implement a stencil-based application, turns out to be the best optimization technique amongst those cited in the literature. All the improvements provided by transformations presented in this thesis have been both formally proved and experimentally tested on an heterogeneous set of architectures including clusters and diļ¬€erent types of multi-cores

    FastFlow: Efficient Parallel Streaming Applications on Multi-core

    Full text link
    Shared memory multiprocessors come back to popularity thanks to rapid spreading of commodity multi-core architectures. As ever, shared memory programs are fairly easy to write and quite hard to optimise; providing multi-core programmers with optimising tools and programming frameworks is a nowadays challenge. Few efforts have been done to support effective streaming applications on these architectures. In this paper we introduce FastFlow, a low-level programming framework based on lock-free queues explicitly designed to support high-level languages for streaming applications. We compare FastFlow with state-of-the-art programming frameworks such as Cilk, OpenMP, and Intel TBB. We experimentally demonstrate that FastFlow is always more efficient than all of them in a set of micro-benchmarks and on a real world application; the speedup edge of FastFlow over other solutions might be bold for fine grain tasks, as an example +35% on OpenMP, +226% on Cilk, +96% on TBB for the alignment of protein P01111 against UniProt DB using Smith-Waterman algorithm.Comment: 23 pages + cove

    Minimizing Communications with Q-transformations in Uniform and Affine Stencils

    No full text
    In stencil based parallel applications, communications represent the main overhead, especially when targeting a fine grain parallelization in order to reduce the completion time. Techniques that minimize the number and the impact of communications are clearly relevant. In literature the best optimization reduces the number of communications per step from 3dim, featured by a naive implementation, to 2*dim, where dim is the number of the domain dimensions. To break down the previous bound, in the paper we introduce and formally prove Q-transformations, for stencils featuring data dependencies that can be expressed as geometric affine translations. Q-transformations, based on data dependencies orientations though space translations, lowers the number of communications per step to dim

    Efficient Smith-Waterman on multi-core with FastFlow

    No full text
    Abstractā€”Shared memory multiprocessors have returned to popularity thanks to rapid spreading of commodity multi-core architectures. However, little attention has been paid to supporting effective streaming applications on these architectures. In this paper we describe FastFlow, a low-level programming framework based on lock-free queues explicitly designed to support high-level languages for streaming applications. We compare FastFlow with state-of-theart programming frameworks such as Cilk, OpenMP, and Intel TBB. We experimentally demonstrate that FastFlow is always more efficient than them on a given real world application: the speedup of FastFlow over other solutions may be substantial for fine grain tasks, for example +35% over OpenMP, +226 % over Cilk, +96 % over TBB for the alignment of protein P01111 against UniProt DB using the Smith-Waterman algorithm. I
    corecore